The 2019-2020 Coronavirus Pandemic Analysis

Contact: Smith Research

BACKGROUND & APPROACH

I wanted to track and trend the coronavirus outbreak on my own curiosity. There are some interesting questions that may fall out of this, as it is a very historic moment, including scientifically and analytically (we have a large amount of data being shared across the globe, analyzed in real-time). The world has come to a halt because of it.
This analysis attempts to answer the following questions (more to come):

  1. What does the trend of the pandemic look like to date?
  2. What are future case predictions based on historical model?
  3. What interesting quirks or patterns emerge?

ASSUMPTIONS & LIMITATIONS: * This data is limited by the source. I realized early on that depending on source there were conflicting # of cases. Originally I was using JHU data… but this was always ‘ahead’ of the Our World In Data. I noticed that JHU’s website was buggy- you clicked on the U.S. stats but it didn’t reflect the U.S.. So I changed data sources to be more consistent with what is presented in the media (and Our World In Data has more extensive plots I can compare my own to). An interesting aside might be why the discrepancy? Was I missing something?
* Defintiions are important as is the idea that multiple varibales accumulate in things like total cases (more testing for example).

SOURCE RAW DATA: * https://ourworldindata.org/coronavirus
* https://github.com/CSSEGISandData/COVID-19/
*

INPUT DATA LOCATION: github (https://github.com/sbs87/coronavirus/tree/master/data)

OUTPUT DATA LOCATIOn: github (https://github.com/sbs87/coronavirus/tree/master/results)

TIMESTAMP

Start: ##—— Tue Jun 16 21:20:00 2020 ——##

PRE-ANALYSIS

The following sections are outside the scope of the ‘analysis’ but are still needed to prepare everything

UPSTREAM PROCESSING/ANALYSIS

  1. Google Mobility Scraping, script available at get_google_mobility.py
# Mobility data has to be extracted from Google PDF reports using a web scraping script (python , written by Peter Simone, https://github.com/petersim1/MIT_COVID19)

# See get_google_mobility.py for local script 

python3 get_google_mobility.py
# writes csv file of mobility data as "mobility.csv"

SET UP ENVIORNMENT

Load libraries and set global variables

# timestamp start
timestamp()
## ##------ Tue Jun 16 21:20:00 2020 ------##

# clear previous enviornment
rm(list = ls())

##------------------------------------------
## LIBRARIES
##------------------------------------------
library(plyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::arrange()   masks plyr::arrange()
## x purrr::compact()   masks plyr::compact()
## x dplyr::count()     masks plyr::count()
## x dplyr::failwith()  masks plyr::failwith()
## x dplyr::filter()    masks stats::filter()
## x dplyr::id()        masks plyr::id()
## x dplyr::lag()       masks stats::lag()
## x dplyr::mutate()    masks plyr::mutate()
## x dplyr::rename()    masks plyr::rename()
## x dplyr::summarise() masks plyr::summarise()
## x dplyr::summarize() masks plyr::summarize()
library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(plot.utils)
library(utils)
library(knitr)

##------------------------------------------

##------------------------------------------
# GLOBAL VARIABLES
##------------------------------------------
user_name <- Sys.info()["user"]
working_dir <- paste0("/Users/", user_name, "/Projects/coronavirus/")  # don't forget trailing /
results_dir <- paste0(working_dir, "results/")  # assumes diretory exists
results_dir_custom <- paste0(results_dir, "custom/")  # assumes diretory exists


Corona_Cases.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
Corona_Cases.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv"
Corona_Deaths.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv"
Corona_Deaths.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"

Corona_Cases.fn <- paste0(working_dir, "data/", basename(Corona_Cases.source_url))
Corona_Cases.US.fn <- paste0(working_dir, "data/", basename(Corona_Cases.US.source_url))
Corona_Deaths.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.source_url))
Corona_Deaths.US.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.US.source_url))
default_theme <- theme_bw() + theme(text = element_text(size = 14))  # fix this
##------------------------------------------

FUNCTIONS

List of functions

function_name description
prediction_model outputs case estumate for given log-linear moder parameters slope and intercept
make_long converts input data to long format (specialized cases)
name_overlaps outputs the column names intersection and set diffs of two data frame
find_linear_index finds the first date at which linearaity occurs
##------------------------------------------
## FUNCTION: prediction_model
##------------------------------------------
## --- //// ----
# Takes days vs log10 (case) linear model parameters and a set of days since 100 cases and outputs a dataframe with total number of predicted cases for those days
## --- //// ----
prediction_model<-function(m=1,b=0,days=1){
  total_cases<-m*days+b
  total_cases.log<-log(total_cases,10)
  prediction<-data.frame(days=days,Total_confirmed_cases_perstate=total_cases)
  return(prediction)
}
##------------------------------------------

##------------------------------------------
## FUNCTION: make_long
##------------------------------------------
## --- //// ----
# Takes wide-format case data and converts into long format, using date and total cases as variable/values. Also enforces standardization/assumes data struture naming by using fixed variable name, value name, id.vars, 
## --- //// ----
make_long<-function(data_in,variable.name = "Date",
                   value.name = "Total_confirmed_cases",
                   id.vars=c("case_type","Province.State","Country.Region","Lat","Long","City","Population")){

long_data<-melt(data_in,
                id.vars = id.vars,
                variable.name=variable.name,
                value.name=value.name)
return(long_data)

}
##------------------------------------------

## THIS WILL BE IN UTILS AT SOME POINT
name_overlaps<-function(df1,df2){
i<-intersect(names(df1),
names(df2))
sd1<-setdiff(names(df1),
names(df2))
sd2<-setdiff(names(df2),names(df1))
cat("intersection:\n",paste(i,"\n"))
cat("in df1 but not df2:\n",paste(sd1,"\n"))
cat("in df2 but not df1:\n",paste(sd2,"\n"))
return(list("int"=i,"sd_1_2"=sd1,"sd_2_1"=sd2))
}

##------------------------------------------

##------------------------------------------
## FUNCTION: find_linear_index
##------------------------------------------
## --- //// ----
# Find date at which total case data is linear (for a given data frame) 
## --- //// ----

find_linear_index<-function(tmp,running_avg=5){
  tmp$Total_confirmed_cases_perstate.log<-log(tmp$Total_confirmed_cases_perstate,2)
  derivative<-data.frame(matrix(nrow = nrow(tmp),ncol = 4))
  names(derivative)<-c("m.time","mm.time","cumsum","date")
  
  # First derivative
  for(t in 2:nrow(tmp)){
    slope.t<- tmp[t,"Total_confirmed_cases_perstate.log"]- tmp[t-1,"Total_confirmed_cases_perstate.log"]
    derivative[t,"m.time"]<-slope.t
    derivative[t,"date"]<-as.Date(tmp[t,"Date"])
  }
  
  # Second derivative
  for(t in 2:nrow(derivative)){
    slope.t<- derivative[t,"m.time"]- derivative[t-1,"m.time"]
    derivative[t,"mm.time"]<-slope.t
  }
  
  #Compute running sum of second derivative (window = 5). Choose point at which within 0.2
  for(t in running_avg:nrow(derivative)){
    slope.t<- sum(abs(derivative[t:(t-4),"mm.time"])<0.2,na.rm = T)
    derivative[t,"cumsum"]<-slope.t
  }
  
  #Find date -5 from the stablility point
  linear_begin<-min(derivative[!is.na(derivative$cumsum) & derivative$cumsum==running_avg,"date"])-running_avg
  
  return(linear_begin)
}

READ IN DATA

# Q: do we want to archive previous versions? Maybe an auto git mv?

##------------------------------------------
## Download and read in latest data from github
##------------------------------------------
download.file(Corona_Cases.source_url, destfile = Corona_Cases.fn)
Corona_Totals.raw <- read.csv(Corona_Cases.fn, header = T, stringsAsFactors = F)

download.file(Corona_Cases.US.source_url, destfile = Corona_Cases.US.fn)
Corona_Totals.US.raw <- read.csv(Corona_Cases.US.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.source_url, destfile = Corona_Deaths.fn)
Corona_Deaths.raw <- read.csv(Corona_Deaths.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.US.source_url, destfile = Corona_Deaths.US.fn)
Corona_Deaths.US.raw <- read.csv(Corona_Deaths.US.fn, header = T, stringsAsFactors = F)

# latest date on all data:
paste("US deaths:", names(Corona_Deaths.US.raw)[ncol(Corona_Deaths.US.raw)])
## [1] "US deaths: X6.15.20"
paste("US total:", names(Corona_Totals.US.raw)[ncol(Corona_Totals.US.raw)])
## [1] "US total: X6.15.20"
paste("World deaths:", names(Corona_Deaths.raw)[ncol(Corona_Deaths.raw)])
## [1] "World deaths: X6.15.20"
paste("World total:", names(Corona_Totals.raw)[ncol(Corona_Totals.raw)])
## [1] "World total: X6.15.20"

PROCESS DATA

  • Convert to long format
  • Fix date formatting/convert to numeric date
  • Log10 transform total # cases
##------------------------------------------
## Combine death and total data frames
##------------------------------------------
Corona_Totals.raw$case_type<-"total"
Corona_Totals.US.raw$case_type<-"total"
Corona_Deaths.raw$case_type<-"death"
Corona_Deaths.US.raw$case_type<-"death"

# for some reason, Population listed in US death file but not for other data... Weird. When combining, all datasets will have this column, but US deaths is the only useful one.  
Corona_Totals.US.raw$Population<-"NA" 
Corona_Totals.raw$Population<-"NA"
Corona_Deaths.raw$Population<-"NA"

Corona_Cases.raw<-rbind(Corona_Totals.raw,Corona_Deaths.raw)
Corona_Cases.US.raw<-rbind(Corona_Totals.US.raw,Corona_Deaths.US.raw)
#TODO: custom utils- setdiff, intersect names... option to output in merging too
##------------------------------------------
# prepare raw datasets for eventual combining
##------------------------------------------
Corona_Cases.raw$City<-"NA" # US-level data has Cities
Corona_Cases.US.raw$Country_Region<-"US_state" # To differentiate from World-level stats

Corona_Cases.US.raw<-plyr::rename(Corona_Cases.US.raw,c("Province_State"="Province.State",
                                                  "Country_Region"="Country.Region",
                                                  "Long_"="Long",
                                                  "Admin2"="City"))


##------------------------------------------
## Convert to long format
##------------------------------------------
#JHU has a gross file format. It's in wide format with each column is the date in MM/DD/YY. So read this in as raw data but trasnform it to be better suited for analysis
# Furthermore, the World and US level data is formatted differently, containing different columns, etc. Recitfy this and combine the world-level stats with U.S. level stats.

Corona_Cases.long<-rbind(make_long(select(Corona_Cases.US.raw,-c(UID,iso2,iso3,code3,FIPS,Combined_Key))),
make_long(Corona_Cases.raw))


##------------------------------------------
## Fix date formatting, convert to numeric date
##------------------------------------------
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "^X",replacement = "0") # leading 0 read in as X
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "20$",replacement = "2020") # ends in .20 and not 2020
Corona_Cases.long$Date<-as.Date(Corona_Cases.long$Date,format = "%m.%d.%y")
Corona_Cases.long$Date.numeric<-as.numeric(Corona_Cases.long$Date)

kable(table(select(Corona_Cases.long,c("Country.Region","case_type"))),caption = "Number of death and total case longitudinal datapoints per geographical region")
Number of death and total case longitudinal datapoints per geographical region
death total
Afghanistan 146 146
Albania 146 146
Algeria 146 146
Andorra 146 146
Angola 146 146
Antigua and Barbuda 146 146
Argentina 146 146
Armenia 146 146
Australia 1168 1168
Austria 146 146
Azerbaijan 146 146
Bahamas 146 146
Bahrain 146 146
Bangladesh 146 146
Barbados 146 146
Belarus 146 146
Belgium 146 146
Belize 146 146
Benin 146 146
Bhutan 146 146
Bolivia 146 146
Bosnia and Herzegovina 146 146
Botswana 146 146
Brazil 146 146
Brunei 146 146
Bulgaria 146 146
Burkina Faso 146 146
Burma 146 146
Burundi 146 146
Cabo Verde 146 146
Cambodia 146 146
Cameroon 146 146
Canada 2044 2044
Central African Republic 146 146
Chad 146 146
Chile 146 146
China 4818 4818
Colombia 146 146
Comoros 146 146
Congo (Brazzaville) 146 146
Congo (Kinshasa) 146 146
Costa Rica 146 146
Cote d’Ivoire 146 146
Croatia 146 146
Cuba 146 146
Cyprus 146 146
Czechia 146 146
Denmark 438 438
Diamond Princess 146 146
Djibouti 146 146
Dominica 146 146
Dominican Republic 146 146
Ecuador 146 146
Egypt 146 146
El Salvador 146 146
Equatorial Guinea 146 146
Eritrea 146 146
Estonia 146 146
Eswatini 146 146
Ethiopia 146 146
Fiji 146 146
Finland 146 146
France 1606 1606
Gabon 146 146
Gambia 146 146
Georgia 146 146
Germany 146 146
Ghana 146 146
Greece 146 146
Grenada 146 146
Guatemala 146 146
Guinea 146 146
Guinea-Bissau 146 146
Guyana 146 146
Haiti 146 146
Holy See 146 146
Honduras 146 146
Hungary 146 146
Iceland 146 146
India 146 146
Indonesia 146 146
Iran 146 146
Iraq 146 146
Ireland 146 146
Israel 146 146
Italy 146 146
Jamaica 146 146
Japan 146 146
Jordan 146 146
Kazakhstan 146 146
Kenya 146 146
Korea, South 146 146
Kosovo 146 146
Kuwait 146 146
Kyrgyzstan 146 146
Laos 146 146
Latvia 146 146
Lebanon 146 146
Lesotho 146 146
Liberia 146 146
Libya 146 146
Liechtenstein 146 146
Lithuania 146 146
Luxembourg 146 146
Madagascar 146 146
Malawi 146 146
Malaysia 146 146
Maldives 146 146
Mali 146 146
Malta 146 146
Mauritania 146 146
Mauritius 146 146
Mexico 146 146
Moldova 146 146
Monaco 146 146
Mongolia 146 146
Montenegro 146 146
Morocco 146 146
Mozambique 146 146
MS Zaandam 146 146
Namibia 146 146
Nepal 146 146
Netherlands 730 730
New Zealand 146 146
Nicaragua 146 146
Niger 146 146
Nigeria 146 146
North Macedonia 146 146
Norway 146 146
Oman 146 146
Pakistan 146 146
Panama 146 146
Papua New Guinea 146 146
Paraguay 146 146
Peru 146 146
Philippines 146 146
Poland 146 146
Portugal 146 146
Qatar 146 146
Romania 146 146
Russia 146 146
Rwanda 146 146
Saint Kitts and Nevis 146 146
Saint Lucia 146 146
Saint Vincent and the Grenadines 146 146
San Marino 146 146
Sao Tome and Principe 146 146
Saudi Arabia 146 146
Senegal 146 146
Serbia 146 146
Seychelles 146 146
Sierra Leone 146 146
Singapore 146 146
Slovakia 146 146
Slovenia 146 146
Somalia 146 146
South Africa 146 146
South Sudan 146 146
Spain 146 146
Sri Lanka 146 146
Sudan 146 146
Suriname 146 146
Sweden 146 146
Switzerland 146 146
Syria 146 146
Taiwan* 146 146
Tajikistan 146 146
Tanzania 146 146
Thailand 146 146
Timor-Leste 146 146
Togo 146 146
Trinidad and Tobago 146 146
Tunisia 146 146
Turkey 146 146
Uganda 146 146
Ukraine 146 146
United Arab Emirates 146 146
United Kingdom 1606 1606
Uruguay 146 146
US 146 146
US_state 476106 476106
Uzbekistan 146 146
Venezuela 146 146
Vietnam 146 146
West Bank and Gaza 146 146
Western Sahara 146 146
Yemen 146 146
Zambia 146 146
Zimbabwe 146 146
# Decouple population and lat/long data, refactor to make it more tidy
metadata_columns<-c("Lat","Long","Population")
metadata<-unique(select(filter(Corona_Cases.long,case_type=="death"),c("Country.Region","Province.State","City",all_of(metadata_columns))))
Corona_Cases.long<-select(Corona_Cases.long,-all_of(metadata_columns))

# Some counties are not summarized on the country level. collapse all but US
Corona_Cases.long<-rbind.fill(ddply(filter(Corona_Cases.long,!Country.Region=="US_state"),c("case_type","Country.Region","Date","Date.numeric"),summarise,Total_confirmed_cases=sum(Total_confirmed_cases)),filter(Corona_Cases.long,Country.Region=="US_state"))

# Put total case and deaths side-by-side (wide)
Corona_Cases<-spread(Corona_Cases.long,key = case_type,value = Total_confirmed_cases)

#Compute moratlity rate
Corona_Cases$mortality_rate<-Corona_Cases$death/Corona_Cases$total

#TMP
Corona_Cases<-plyr::rename(Corona_Cases,c("total"="Total_confirmed_cases","death"="Total_confirmed_deaths"))

##------------------------------------------
## log10 transform total # cases
##------------------------------------------
Corona_Cases$Total_confirmed_cases.log<-log(Corona_Cases$Total_confirmed_cases,10)
Corona_Cases$Total_confirmed_deaths.log<-log(Corona_Cases$Total_confirmed_deaths,10)
##------------------------------------------
       
##------------------------------------------
## Compute # of days since 100th for US data
##------------------------------------------

# Find day that 100th case was found for Country/Province. NOTE: Non US countries may have weird provinces. For example, Frane is summairzed at the country level but also had 3 providences. I've only ensured the U.S. case100 works... so the case100_date for U.S. is summarized both for the entire country (regardless of state) and on a per-state level. 
# TODO: consider city-level summary as well. This data may be sparse

Corona_Cases<-merge(Corona_Cases,ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,case100_date=min(Date.numeric)))
Corona_Cases$Days_since_100<-Corona_Cases$Date.numeric-Corona_Cases$case100_date

##------------------------------------------
## Add population and lat/long data (CURRENTLY US ONLY)
##------------------------------------------

kable(filter(metadata,(is.na(Country.Region) | is.na(Population) )) %>% select(c("Country.Region","Province.State","City")) %>% unique(),caption = "Regions for which either population or Country is NA")
Regions for which either population or Country is NA
Country.Region Province.State City
# Drop missing data 
metadata<-filter(metadata,!(is.na(Country.Region) | is.na(Population) ))
# Convert remaining pop to numeric
metadata$Population<-as.numeric(metadata$Population)
## Warning: NAs introduced by coercion
# Add metadata to cases
Corona_Cases<-merge(Corona_Cases,metadata,all.x = T)

##------------------------------------------
## Compute total and death cases relative to population 
##------------------------------------------

Corona_Cases$Total_confirmed_cases.per100<-100*Corona_Cases$Total_confirmed_cases/Corona_Cases$Population
Corona_Cases$Total_confirmed_deaths.per100<-100*Corona_Cases$Total_confirmed_deaths/Corona_Cases$Population


##------------------------------------------
## Filter df for US state-wide stats
##------------------------------------------

Corona_Cases.US_state<-filter(Corona_Cases,Country.Region=="US_state" & Total_confirmed_cases>0 )
Corona_Cases.US_state[!is.na(Corona_Cases.US_state$Province.State) & Corona_Cases.US_state$Province.State=="Pennsylvania" & Corona_Cases.US_state$City=="Delaware","City"]<-"Delaware_PA"

kable(table(select(Corona_Cases.US_state,c("Province.State"))),caption = "Number of longitudinal datapoints (total/death) per state")
Number of longitudinal datapoints (total/death) per state
Var1 Freq
Alabama 5621
Alaska 1126
Arizona 1376
Arkansas 5955
California 5244
Colorado 5087
Connecticut 817
Delaware 343
Diamond Princess 91
District of Columbia 92
Florida 5926
Georgia 13391
Grand Princess 92
Guam 92
Hawaii 472
Idaho 2724
Illinois 7866
Indiana 7757
Iowa 7380
Kansas 6240
Kentucky 8866
Louisiana 5580
Maine 1423
Maryland 2153
Massachusetts 1366
Michigan 6906
Minnesota 6565
Mississippi 6925
Missouri 7925
Montana 2438
Nebraska 4799
Nevada 1101
New Hampshire 972
New Jersey 2059
New Mexico 2420
New York 5251
North Carolina 8236
North Dakota 2973
Northern Mariana Islands 77
Ohio 7325
Oklahoma 5674
Oregon 2821
Pennsylvania 5740
Puerto Rico 92
Rhode Island 544
South Carolina 4019
South Dakota 3767
Tennessee 7950
Texas 17232
Utah 1253
Vermont 1305
Virgin Islands 92
Virginia 10452
Washington 3580
West Virginia 3915
Wisconsin 5642
Wyoming 1757
Corona_Cases.US_state<-merge(Corona_Cases.US_state,ddply(filter(Corona_Cases.US_state,Total_confirmed_cases>100),c("Province.State"),summarise,case100_date_state=min(Date.numeric)))
Corona_Cases.US_state$Days_since_100_state<-Corona_Cases.US_state$Date.numeric-Corona_Cases.US_state$case100_date_state

ANALYSIS

Q1: What is the trend in cases, mortality across geopgraphical regions?

Plot # of cases vs time
* For each geographical set:
* comparative longitudinal case trend (absolute & log scale)
* comparative longitudinal mortality trend
* death vs total correlation

question dataset x y color facet pch dimentions
comparative_longitudinal_case_trend long time log_cases geography none (case type?) case_type [15, 50, 4] geography x (2 scale?) case type
comparative longitudinal case trend long time cases geography case_type ? [15, 50, 4] geography x (2+ scale) case type
comparative longitudinal mortality trend wide time mortality rate geography none none [15, 50, 4] geography
death vs total correlation wide cases deaths geography none none [15, 50, 4] geography
# total cases vs time
# death cases vs time
# mortality rate vs time
# death vs mortality


  # death vs mortality
  # total & death case vs time (same plot)

#<question> <x> <y> <colored> <facet> <dataset>
## trend in case/deaths over time, comapred across regions <time> <log cases> <geography*> <none> <.wide>
## trend in case/deaths over time, comapred across regions <time> <cases> <geography*> <case_type> <.long>
## trend in mortality rate over time, comapred across regions <time> <mortality rate> <geography*> <none>
## how are death/mortality related/correlated? <time> <log cases> <geography*> <none>
## how are death and case load correlated? <cases> <deaths>

# lm for each?? - > apply lm from each region starting from 100th case. m, b associated with each.
    # input: geographical regsion, logcase vs day (100th case)
    # output: m, b for each geographical region ID



#total/death on same plot-  diffeer by 2 logs, so when plotting log, use pch. when plotting absolute, need to use free scales
#when plotting death and case on same, melt. 

#CoronaCases - > filter sets (3)
  #world - choose countries with sufficent data

N<-ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,n=length(Country.Region))
ggplot(filter(N,n<100),aes(x=n))+
  geom_histogram()+
  default_theme+
  ggtitle("Distribution of number of days with at least 100 confirmed cases for each region")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

kable(arrange(N,-n),caption="Sorted number of days with at least 100 confirmed cases")
Sorted number of days with at least 100 confirmed cases
Country.Region n
US_state 61332
China 146
Diamond Princess 127
Korea, South 117
Japan 116
Italy 114
Iran 111
Singapore 108
France 107
Germany 107
Spain 106
US 104
Switzerland 103
United Kingdom 103
Belgium 102
Netherlands 102
Norway 102
Sweden 102
Austria 100
Malaysia 99
Australia 98
Bahrain 98
Denmark 98
Canada 97
Qatar 97
Iceland 96
Brazil 95
Czechia 95
Finland 95
Greece 95
Iraq 95
Israel 95
Portugal 95
Slovenia 95
Egypt 94
Estonia 94
India 94
Ireland 94
Kuwait 94
Philippines 94
Poland 94
Romania 94
Saudi Arabia 94
Indonesia 93
Lebanon 93
Pakistan 93
San Marino 93
Thailand 93
Chile 92
Luxembourg 91
Peru 91
Russia 91
Ecuador 90
Mexico 90
Slovakia 90
South Africa 90
United Arab Emirates 90
Armenia 89
Colombia 89
Croatia 89
Panama 89
Serbia 89
Taiwan* 89
Turkey 89
Argentina 88
Bulgaria 88
Latvia 88
Uruguay 88
Algeria 87
Costa Rica 87
Dominican Republic 87
Hungary 87
Andorra 86
Bosnia and Herzegovina 86
Jordan 86
Lithuania 86
Morocco 86
New Zealand 86
North Macedonia 86
Vietnam 86
Albania 85
Cyprus 85
Malta 85
Moldova 85
Brunei 84
Burkina Faso 84
Sri Lanka 84
Tunisia 84
Ukraine 83
Azerbaijan 82
Ghana 82
Kazakhstan 82
Oman 82
Senegal 82
Venezuela 82
Afghanistan 81
Cote d’Ivoire 81
Cuba 80
Mauritius 80
Uzbekistan 80
Cambodia 79
Cameroon 79
Honduras 79
Nigeria 79
West Bank and Gaza 79
Belarus 78
Georgia 78
Bolivia 77
Kosovo 77
Kyrgyzstan 77
Montenegro 77
Congo (Kinshasa) 76
Kenya 75
Niger 74
Guinea 73
Rwanda 73
Trinidad and Tobago 73
Paraguay 72
Bangladesh 71
Djibouti 69
El Salvador 68
Guatemala 67
Madagascar 66
Mali 65
Congo (Brazzaville) 62
Jamaica 62
Gabon 60
Somalia 60
Tanzania 60
Ethiopia 59
Burma 58
Sudan 57
Liberia 56
Maldives 54
Equatorial Guinea 53
Cabo Verde 51
Sierra Leone 49
Guinea-Bissau 48
Togo 48
Zambia 47
Eswatini 46
Chad 45
Tajikistan 44
Haiti 42
Sao Tome and Principe 42
Benin 40
Nepal 40
Uganda 40
Central African Republic 39
South Sudan 39
Guyana 37
Mozambique 36
Yemen 32
Mongolia 31
Mauritania 28
Nicaragua 28
Malawi 22
Syria 22
Zimbabwe 20
Bahamas 19
Libya 19
Comoros 17
Suriname 9
Angola 6
Eritrea 1
# Pick top 15 countries with data
max_colors<-12
# find way to fix this- China has diff provences. Plot doesnt look right...
sufficient_data<-arrange(filter(N,!Country.Region %in% c("US_state", "Diamond Princess")),-n)[1:max_colors,]
kable(sufficient_data,caption = paste0("Top ",max_colors," countries with sufficient data"))
Top 12 countries with sufficient data
Country.Region n
China 146
Korea, South 117
Japan 116
Italy 114
Iran 111
Singapore 108
France 107
Germany 107
Spain 106
US 104
Switzerland 103
United Kingdom 103
Corona_Cases.world<-filter(Corona_Cases,Country.Region %in% c(sufficient_data$Country.Region))


  #us 
  #    - by state
Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
# summarize 
#!City %in% c("Unassigned") 
  #    - specific cities
#mortality_rate!=Inf & mortality_rate<=1
head(Corona_Cases)
##   Country.Region Province.State City       Date Date.numeric
## 1    Afghanistan           <NA> <NA> 2020-03-11        18332
## 2    Afghanistan           <NA> <NA> 2020-04-20        18372
## 3    Afghanistan           <NA> <NA> 2020-04-18        18370
## 4    Afghanistan           <NA> <NA> 2020-04-08        18360
## 5    Afghanistan           <NA> <NA> 2020-04-22        18374
## 6    Afghanistan           <NA> <NA> 2020-04-10        18362
##   Total_confirmed_deaths Total_confirmed_cases mortality_rate
## 1                      0                     7     0.00000000
## 2                     36                  1026     0.03508772
## 3                     30                   933     0.03215434
## 4                     14                   444     0.03153153
## 5                     40                  1176     0.03401361
## 6                     15                   521     0.02879079
##   Total_confirmed_cases.log Total_confirmed_deaths.log case100_date
## 1                  0.845098                       -Inf        18348
## 2                  3.011147                   1.556303        18348
## 3                  2.969882                   1.477121        18348
## 4                  2.647383                   1.146128        18348
## 5                  3.070407                   1.602060        18348
## 6                  2.716838                   1.176091        18348
##   Days_since_100 Lat Long Population Total_confirmed_cases.per100
## 1            -16  NA   NA         NA                           NA
## 2             24  NA   NA         NA                           NA
## 3             22  NA   NA         NA                           NA
## 4             12  NA   NA         NA                           NA
## 5             26  NA   NA         NA                           NA
## 6             14  NA   NA         NA                           NA
##   Total_confirmed_deaths.per100
## 1                            NA
## 2                            NA
## 3                            NA
## 4                            NA
## 5                            NA
## 6                            NA
Corona_Cases[!is.na(Corona_Cases$Province.State) & Corona_Cases$Province.State=="Pennsylvania" & Corona_Cases$City=="Delaware","City"]<-"Delaware_PA"


Corona_Cases.UScity<-filter(Corona_Cases,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") & City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA"))


measure_vars_long<-c("Total_confirmed_cases.log","Total_confirmed_cases","Total_confirmed_deaths","Total_confirmed_deaths.log")
melt_arg_list<-list(variable.name = "case_type",value.name = "cases",measure.vars = c("Total_confirmed_cases","Total_confirmed_deaths"))
melt_arg_list$data=NULL


melt_arg_list$data=select(Corona_Cases.world,-ends_with(match = "log"))
Corona_Cases.world.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.UScity,-ends_with(match = "log"))
Corona_Cases.UScity.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.US_state,-ends_with(match = "log"))
Corona_Cases.US_state.long<-do.call(melt,melt_arg_list)

Corona_Cases.world.long$cases.log<-log(Corona_Cases.world.long$cases,10)
Corona_Cases.US_state.long$cases.log<-log(Corona_Cases.US_state.long$cases,10)
Corona_Cases.UScity.long$cases.log<-log(Corona_Cases.UScity.long$cases,10)


# what is the current death and total case load for US? For world? For states?
#-absolute
#-log

# what is mortality rate (US, world)
#-absolute

#how is death and case correlated? (US, world)
#-absolute
#Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
#Corona_Cases.US.case100<-filter(Corona_Cases.US, Days_since_100>=0)
# linear model parameters
#(model_fit<-lm(formula = Total_confirmed_cases.log~Days_since_100,data= Corona_Cases.US.case100 ))

#(slope<-model_fit$coefficients[2])
#(intercept<-model_fit$coefficients[1])

# Correlation coefficient
#cor(x = Corona_Cases.US.case100$Days_since_100,y = Corona_Cases.US.case100$Total_confirmed_cases.log)

##------------------------------------------
## Plot World Data
##------------------------------------------
# Timestamp for world
timestamp_plot.world<-paste("Most recent date for which data available:",max(Corona_Cases.world$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")


# Base template for plots
baseplot.world<-ggplot(data=NULL,aes(x=Days_since_100,col=Country.Region))+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))


##/////////////////////////
### Plot Longitudinal cases

(Corona_Cases.world.long.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world)
    )

(Corona_Cases.world.loglong.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases.log))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases.log))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world))

##/////////////////////////
### Plot Longitudinal mortality rate

(Corona_Cases.world.mortality.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world,aes(y=mortality_rate))+
    geom_line(data=Corona_Cases.world,aes(y=mortality_rate))+
    ylim(c(0,0.3))+
    ggtitle(timestamp_plot.world))
## Warning: Removed 100 rows containing missing values (geom_point).
## Warning: Removed 100 row(s) containing missing values (geom_path).

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.world.casecor.plot<-ggplot(Corona_Cases.world,aes(x=Total_confirmed_cases,y=Total_confirmed_deaths,col=Country.Region))+
  geom_point()+
  geom_line()+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
    ggtitle(timestamp_plot.world))

### Write polots

write_plot(Corona_Cases.world.long.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.long.plot.png"
write_plot(Corona_Cases.world.loglong.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.loglong.plot.png"
write_plot(Corona_Cases.world.mortality.plot,wd = results_dir)
## Warning: Removed 100 rows containing missing values (geom_point).

## Warning: Removed 100 row(s) containing missing values (geom_path).
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.mortality.plot.png"
write_plot(Corona_Cases.world.casecor.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.casecor.plot.png"
##------------------------------------------
## Plot US State Data
##-----------------------------------------

baseplot.US<-ggplot(data=NULL,aes(x=Days_since_100_state,col=case_type))+
  default_theme+
  facet_wrap(~Province.State)+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))

Corona_Cases.US_state.long.plot<-baseplot.US+geom_point(data=Corona_Cases.US_state.long,aes(y=cases.log))
##------------------------------------------
## Plot US City Data
##-----------------------------------------

Corona_Cases.US.plotdata<-filter(Corona_Cases.US_state,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") &
                                   City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA") &
                                   Total_confirmed_cases>0) 
timestamp_plot<-paste("Most recent date for which data available:",max(Corona_Cases.US.plotdata$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")

city_colors<-c("Bucks"='#beaed4',"Baltimore City"='#386cb0', "New York"='#7fc97f',"Burlington"='#fdc086',"Cape May"="#e78ac3","Delaware_PA"="#b15928")

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.city.loglong.plot<-ggplot(melt(Corona_Cases.US.plotdata,measure.vars = c("Total_confirmed_cases.log","Total_confirmed_deaths.log"),variable.name = "case_type",value.name = "cases"),aes(x=Date,y=cases,col=City,pch=case_type))+
  geom_point(size=4)+
    geom_line()+
  default_theme+
  #facet_wrap(~case_type)+
    ggtitle(paste("Log10 total and death cases over time,",timestamp_plot))+
theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
    scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.long.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State,scales = "free_y")+
  ggtitle(paste("MD, PA, NJ total cases over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))
+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.mortality.plot<-ggplot(Corona_Cases.US.plotdata,aes(x=Date,y=mortality_rate,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Mortality rate (deaths/total) over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.casecor.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(y=Total_confirmed_deaths,x=Total_confirmed_cases,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Correlation of death vs total cases,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
  scale_color_manual(values = city_colors))

(Corona_Cases.city.long.normalized.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases.per100,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State)+
  ggtitle(paste("MD, PA, NJ total cases over time per 100 people,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)  +
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

write_plot(Corona_Cases.city.long.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.plot.png"
write_plot(Corona_Cases.city.loglong.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.loglong.plot.png"
write_plot(Corona_Cases.city.mortality.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.mortality.plot.png"
write_plot(Corona_Cases.city.casecor.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.casecor.plot.png"
write_plot(Corona_Cases.city.long.normalized.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.normalized.plot.png"

Q1b what is the model

Fit the cases to a linear model 1. Find time at which the case vs date becomes linear in each plot
2. Fit linear model for each city

# What is the predict # of cases for the next few days?
# How is the model performing historically?

Corona_Cases.US_state.summary<-ddply(Corona_Cases.US_state,
                                     c("Province.State","Date"),
                                     summarise,
                                     Total_confirmed_cases_perstate=sum(Total_confirmed_cases)) %>% 
    filter(Total_confirmed_cases_perstate>100)

# Compute the states with the most cases (for coloring and for linear model)
top_states_totals<-head(ddply(Corona_Cases.US_state.summary,c("Province.State"),summarise, Total_confirmed_cases_perstate.max=max(Total_confirmed_cases_perstate)) %>% arrange(-Total_confirmed_cases_perstate.max),n=max_colors)

kable(top_states_totals,caption = "Top 12 States, total count ")
top_states<-top_states_totals$Province.State

# Manually fix states so that Maryland is switched out for New York
top_states_modified<-c(top_states[top_states !="New York"],"Maryland")

# Plot with all states:
(Corona_Cases.US_state.summary.plot<-ggplot(Corona_Cases.US_state.summary,aes(x=Date,y=Total_confirmed_cases_perstate))+
  geom_point()+
  geom_point(data=filter(Corona_Cases.US_state.summary,Province.State %in% top_states),aes(col=Province.State))+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  default_theme+
  theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
  ggtitle("Total confirmed cases per state, top 12 colored")+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Fit linear model to time vs total cases
##-----------------------------------------

# First, find the date at which each state's cases vs time becomes lienar (2nd derivative is about 0)
li<-ddply(Corona_Cases.US_state.summary,c("Province.State"),find_linear_index)

# Compute linear model for each state starting at the point at which data becomes linear
for(i in 1:nrow(li)){
  Province.State.i<-li[i,"Province.State"]
  date.i<-li[i,"V1"]
  data.i<-filter(Corona_Cases.US_state.summary,Province.State==Province.State.i & as.numeric(Date) >= date.i)
  model_results<-lm(data.i,formula = Total_confirmed_cases_perstate~Date)
  slope<-model_results$coefficients[2]
  intercept<-model_results$coefficients[1]
  li[li$Province.State==Province.State.i,"m"]<-slope
  li[li$Province.State==Province.State.i,"b"]<-intercept
  }

# Compute top state case load with fitted model

(Corona_Cases.US_state.lm.plot<-ggplot(filter(Corona_Cases.US_state.summary,Province.State %in% top_states_modified ))+
    geom_abline(data=filter(li,Province.State %in% top_states_modified),
                aes(slope = m,intercept = b,col=Province.State),lty=2)+
    geom_point(aes(x=Date,y=Total_confirmed_cases_perstate,col=Province.State))+
    scale_color_brewer(type = "qualitative",palette = "Paired")+
    default_theme+
    theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
    ggtitle("Total confirmed cases per state, top 12 colored")+
    scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Predict the number of total cases over the next week
##-----------------------------------------

predicted_days<-c(0,1,2,3,7)+as.numeric(as.Date("2020-04-20"))

predicted_days_df<-data.frame(matrix(ncol=3))
names(predicted_days_df)<-c("Province.State","days","Total_confirmed_cases_perstate")

# USe model parameters to estiamte case loads
for(state.i in top_states_modified){
  predicted_days_df<-rbind(predicted_days_df,
                           data.frame(Province.State=state.i,
                                      prediction_model(m = li[li$Province.State==state.i,"m"],
                                                       b =li[li$Province.State==state.i,"b"] ,
                                                       days =predicted_days )))
  }

predicted_days_df$Date<-as.Date(predicted_days_df$days,origin="1970-01-01")

kable(predicted_days_df,caption = "Predicted total cases over the next week for selected states")

##------------------------------------------
## Write plots
##-----------------------------------------

write_plot(Corona_Cases.US_state.summary.plot,wd = results_dir)
write_plot(Corona_Cases.US_state.lm.plot,wd = results_dir)

##------------------------------------------
## Write tables
##-----------------------------------------

write.csv(predicted_days_df,file = paste0(results_dir,"predicted_total_cases_days.csv"),quote = F,row.names = F)

Q2: What is the predicted number of cases?

What is the prediction of COVID-19 based on model thus far? Additional questions:

WHy did it take to day 40 to start a log linear trend? How long will it be till x number of cases? When will the plateu happen? Are any effects noticed with social distancing? Delays

##------------------------------------------
## Prediction and Prediction Accuracy
##------------------------------------------


today_num<-max(Corona_Cases.US$Days_since_100)
predicted_days<-today_num+c(1,2,3,7)

#mods = dlply(mydf, .(x3), lm, formula = y ~ x1 + x2)
#today:
Corona_Cases.US[Corona_Cases.US$Days_since_100==(today_num-1),]
Corona_Cases.US[Corona_Cases.US$Days_since_100==today_num,]
Corona_Cases.US$type<-"Historical"


#prediction_values<-prediction_model(m=slope,b=intercept,days = predicted_days)$Total_confirmed_cases

histoical_model<-data.frame(date=today_num,m=slope,b=intercept)
tmp<-data.frame(state=rep(c("A","B"),each=3),x=c(1,2,3,4,5,6))
tmp$y<-c(tmp[1:3,"x"]+5,tmp[4:6,"x"]*5+1)
ddply(tmp,c("state"))
lm(data =tmp,formula = y~x )

train_lm<-function(input_data,subset_coulmn,formula_input){
case_models <- dlply(input_data, subset_coulmn, lm, formula = formula_input)
case_models.parameters <- ldply(case_models, coef)
case_models.parameters<-rename(case_models.parameters,c("b"="(Intercept)","m"=subset_coulmn))
return(case_models.parameters)
}

train_lm(tmp,"state")

 dlply(input_data, subset_coulmn, lm,m=)
 
# model for previous y days
#historical_model_predictions<-data.frame(day_x=NULL,Days_since_100=NULL,Total_confirmed_cases=NULL,Total_confirmed_cases.log=NULL)
# for(i in c(1,2,3,4,5,6,7,8,9,10)){
#   #i<-1
# day_x<-today_num-i # 1, 2, 3, 4
# day_x_nextweek<-day_x+c(1,2,3)
# model_fit_x<-lm(data = filter(Corona_Cases.US.case100,Days_since_100 < day_x),formula = Total_confirmed_cases.log~Days_since_100)
# prediction_day_x_nextweek<-prediction_model(m = model_fit_x$coefficients[2],b = model_fit_x$coefficients[1],days = day_x_nextweek)
# prediction_day_x_nextweek$type<-"Predicted"
# acutal_day_x_nextweek<-filter(Corona_Cases.US,Days_since_100 %in% day_x_nextweek) %>% select(c(Days_since_100,Total_confirmed_cases,Total_confirmed_cases.log))
# acutal_day_x_nextweek$type<-"Historical"
# historical_model_predictions.i<-data.frame(day_x=day_x,rbind(acutal_day_x_nextweek,prediction_day_x_nextweek))
# historical_model_predictions<-rbind(historical_model_predictions.i,historical_model_predictions)
# }

#historical_model_predictions.withHx<-rbind.fill(historical_model_predictions,data.frame(Corona_Cases.US,type="Historical"))
#historical_model_predictions.withHx$Total_confirmed_cases.log2<-log(historical_model_predictions.withHx$Total_confirmed_cases,2)

(historical_model_predictions.plot<-ggplot(historical_model_predictions.withHx,aes(x=Days_since_100,y=Total_confirmed_cases.log,col=type))+
    geom_point(size=3)+
    default_theme+
    theme(legend.position = "bottom")+ 
      #geom_abline(slope = slope,intercept =intercept,lty=2)+
    #facet_wrap(~case_type,ncol=1)+
    scale_color_manual(values = c("Historical"="#377eb8","Predicted"="#e41a1c")))
write_plot(historical_model_predictions.plot,wd=results_dir)

Q3: What is the effect on social distancing, descreased mobility on case load?

Load data from Google which compoutes % change in user mobility relative to baseline for * Recreation
* Workplace
* Residence
* Park
* Grocery

Data from https://www.google.com/covid19/mobility/

# See pre-processing section for script on gathering mobility data

# UNDER DEVELOPMENT

mobility<-read.csv("/Users/stevensmith/Projects/MIT_COVID19/mobility.csv",header = T,stringsAsFactors = F)
#mobility$Retail_Recreation<-as.numeric(sub(mobility$Retail_Recreation,pattern = "%",replacement = ""))
#mobility$Workplace<-as.numeric(sub(mobility$Workplace,pattern = "%",replacement = ""))
#mobility$Residential<-as.numeric(sub(mobility$Residential,pattern = "%",replacement = ""))

##------------------------------------------
## Show relationship between mobility and caseload
##------------------------------------------
mobility$County<-gsub(mobility$County,pattern = " County",replacement = "")
Corona_Cases.US_state.mobility<-merge(Corona_Cases.US_state,plyr::rename(mobility,c("State"="Province.State","County"="City")))

#Corona_Cases.US_state.tmp<-merge(metadata,Corona_Cases.US_state.tmp)
# Needs to happen upsteam, see todos
#Corona_Cases.US_state.tmp$Total_confirmed_cases.perperson<-Corona_Cases.US_state.tmp$Total_confirmed_cases/as.numeric(Corona_Cases.US_state.tmp$Population)
mobility_measures<-c("Retail_Recreation","Grocery_Pharmacy","Parks","Transit","Workplace","Residential")

plot_data<-filter(Corona_Cases.US_state.mobility, Date.numeric==max(Corona_Cases.US_state$Date.numeric) ) %>% melt(measure.vars=mobility_measures) 
plot_data$value<-as.numeric(gsub(plot_data$value,pattern = "%",replacement = ""))
plot_data<-filter(plot_data,!is.na(value))

(mobility.plot<-ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_grid(Province.State~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases per 100 people(Today)"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

(mobility.global.plot<-ggplot(plot_data,aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_wrap(~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases (Today) per 100 people"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

plot_data.permobility_summary<-ddply(plot_data,c("Province.State","variable"),summarise,cor=cor(y =Total_confirmed_cases.per100,x=value),median_change=median(x=value)) %>% arrange(-abs(cor))

kable(plot_data.permobility_summary,caption = "Ranked per-state mobility correlation with total confirmed cases")
Ranked per-state mobility correlation with total confirmed cases
Province.State variable cor median_change
Alaska Transit 1.0000000 -63.0
Delaware Retail_Recreation 1.0000000 -39.5
Delaware Grocery_Pharmacy 1.0000000 -17.5
Delaware Parks -1.0000000 20.5
Delaware Transit 1.0000000 -37.0
Delaware Workplace 1.0000000 -37.0
Delaware Residential -1.0000000 14.0
Hawaii Grocery_Pharmacy 0.9941206 -34.0
Hawaii Retail_Recreation 0.9715238 -56.0
New Hampshire Parks 0.9512690 -20.0
Vermont Parks 0.9344689 -35.5
Maine Transit -0.9163971 -50.0
Hawaii Parks 0.8937005 -72.0
Connecticut Grocery_Pharmacy -0.8881197 -6.0
Utah Residential -0.8781200 12.0
Hawaii Transit 0.8537962 -89.0
Utah Transit -0.8417294 -18.0
South Dakota Parks 0.8104721 -26.0
Arizona Grocery_Pharmacy -0.7766590 -15.0
Wyoming Parks -0.7738456 -4.0
Rhode Island Workplace -0.7677330 -39.5
Connecticut Transit -0.7592657 -50.0
Alaska Workplace -0.7560343 -33.0
Alaska Grocery_Pharmacy -0.7294322 -7.0
Arizona Residential 0.6758543 13.0
New York Workplace -0.6526389 -34.5
Vermont Grocery_Pharmacy -0.6525202 -25.0
Nevada Transit -0.6434963 -20.0
Alaska Residential 0.6405362 13.0
North Dakota Parks 0.6349369 -34.0
Rhode Island Retail_Recreation -0.6311475 -45.0
Utah Parks -0.6206828 17.0
New Jersey Parks -0.6042205 -6.0
Rhode Island Residential -0.5983355 18.5
Maine Workplace -0.5939051 -30.0
New York Retail_Recreation -0.5899100 -46.0
Nebraska Workplace 0.5828335 -32.0
Idaho Residential -0.5780046 11.0
Utah Workplace -0.5583524 -37.0
Arizona Retail_Recreation -0.5563340 -42.5
New Jersey Workplace -0.5365495 -44.0
New York Parks 0.5294718 20.0
Connecticut Retail_Recreation -0.5246792 -45.0
Hawaii Residential -0.5206074 19.0
Connecticut Residential 0.5178760 14.0
Maine Parks 0.4904580 -31.0
New Jersey Grocery_Pharmacy -0.4841290 2.5
New Mexico Grocery_Pharmacy -0.4829717 -11.0
Connecticut Workplace -0.4822647 -39.0
Montana Parks -0.4790725 -58.0
Kansas Parks 0.4764310 72.0
Iowa Parks -0.4707521 28.5
West Virginia Parks 0.4693285 -33.0
Nebraska Residential -0.4654458 14.0
Arkansas Parks 0.4627176 -12.0
New Mexico Residential 0.4536815 13.5
Arizona Transit 0.4530030 -38.0
Maryland Workplace -0.4505981 -35.0
North Dakota Retail_Recreation -0.4489803 -42.0
Rhode Island Parks 0.4481230 52.0
New Jersey Retail_Recreation -0.4292765 -62.5
Illinois Transit -0.4286841 -31.0
Pennsylvania Workplace -0.4192739 -36.0
South Carolina Workplace 0.4119550 -30.0
Maryland Grocery_Pharmacy -0.4097685 -10.0
New Mexico Parks 0.4085476 -31.5
New Jersey Transit -0.4071358 -50.5
Montana Residential 0.4063500 14.0
Vermont Residential 0.4060174 11.5
New Hampshire Residential -0.3990217 14.0
Kentucky Parks -0.3988241 28.5
Pennsylvania Retail_Recreation -0.3931803 -45.0
Michigan Parks 0.3892118 28.5
Alabama Workplace -0.3867212 -29.0
Alabama Grocery_Pharmacy -0.3736090 -2.0
New Mexico Retail_Recreation -0.3713751 -42.5
New York Transit -0.3675468 -48.0
Oregon Parks -0.3599882 16.5
Wisconsin Transit -0.3452408 -23.5
Nebraska Grocery_Pharmacy 0.3417093 -0.5
Wyoming Transit -0.3387016 -17.0
Idaho Workplace -0.3329882 -29.0
Idaho Grocery_Pharmacy -0.3323841 -5.5
Arkansas Retail_Recreation -0.3271441 -30.0
North Dakota Workplace 0.3249741 -40.0
Maryland Retail_Recreation -0.3230199 -39.0
Missouri Residential -0.3185831 13.0
Virginia Transit -0.3147097 -33.0
Wyoming Workplace -0.3110734 -31.0
California Transit -0.3101593 -42.0
Wyoming Grocery_Pharmacy -0.3090524 -10.0
California Residential 0.3076345 14.0
Massachusetts Grocery_Pharmacy 0.3072704 -9.0
Maine Retail_Recreation -0.3050603 -42.0
Alaska Retail_Recreation 0.3036003 -39.0
Montana Transit 0.3014633 -41.0
California Parks -0.2974301 -38.5
Idaho Transit -0.2947824 -30.0
Nevada Residential 0.2929447 17.0
Colorado Residential 0.2913791 14.0
Illinois Workplace -0.2898271 -30.5
Minnesota Transit -0.2879126 -28.5
Florida Residential 0.2866363 14.0
Pennsylvania Parks 0.2795091 12.0
Vermont Retail_Recreation 0.2789361 -57.0
South Carolina Parks -0.2780452 -23.0
North Carolina Grocery_Pharmacy 0.2708759 0.0
Pennsylvania Grocery_Pharmacy -0.2650489 -6.0
North Carolina Workplace 0.2630552 -31.0
Mississippi Residential 0.2620199 13.0
Vermont Workplace -0.2603392 -43.0
Georgia Grocery_Pharmacy -0.2592861 -10.0
Kansas Workplace 0.2591200 -32.5
Rhode Island Grocery_Pharmacy 0.2579082 -7.5
Maryland Residential 0.2573521 15.0
Texas Workplace 0.2571485 -32.0
Alabama Transit -0.2548553 -36.5
Illinois Parks 0.2496676 26.5
Tennessee Workplace -0.2482566 -31.0
Rhode Island Transit -0.2479675 -56.0
Texas Residential -0.2452178 15.0
West Virginia Grocery_Pharmacy -0.2375160 -6.0
Florida Parks -0.2353336 -43.0
Oregon Workplace -0.2333565 -31.0
Tennessee Residential 0.2327839 11.5
New York Grocery_Pharmacy -0.2291796 8.0
Wisconsin Parks 0.2274697 51.5
Georgia Workplace -0.2273406 -33.5
Massachusetts Retail_Recreation 0.2227235 -47.0
Washington Grocery_Pharmacy 0.2224675 -7.0
Georgia Retail_Recreation -0.2183885 -41.0
South Dakota Workplace 0.2166878 -35.0
Minnesota Parks 0.2141057 -9.0
Hawaii Workplace 0.2124573 -46.0
North Carolina Transit 0.2084433 -32.0
Connecticut Parks 0.2059378 43.0
Nevada Retail_Recreation -0.2017080 -43.0
Missouri Workplace 0.2012827 -29.0
Mississippi Grocery_Pharmacy -0.1978285 -8.0
Kansas Grocery_Pharmacy -0.1962528 -14.0
Arizona Parks -0.1961619 -44.5
North Dakota Grocery_Pharmacy -0.1955758 -8.0
West Virginia Workplace 0.1943950 -33.0
Illinois Residential 0.1910841 14.0
Colorado Parks -0.1908112 2.0
North Carolina Residential 0.1906913 13.0
Nebraska Transit -0.1879742 -9.0
Idaho Retail_Recreation -0.1878367 -39.5
Indiana Parks -0.1863077 29.0
Virginia Residential 0.1843208 14.0
Kentucky Transit -0.1830046 -31.0
Nevada Workplace -0.1803301 -40.0
Wisconsin Residential -0.1769676 14.0
Kentucky Grocery_Pharmacy 0.1749879 4.0
Utah Retail_Recreation -0.1749425 -40.0
Texas Parks 0.1749424 -42.0
Tennessee Parks -0.1702143 10.5
Pennsylvania Transit -0.1677072 -42.0
Indiana Residential 0.1662973 12.0
Alabama Parks 0.1662263 -1.0
New Hampshire Retail_Recreation -0.1662103 -41.0
Massachusetts Parks 0.1626263 39.0
Ohio Transit 0.1589697 -28.0
New Jersey Residential 0.1583593 18.0
South Dakota Residential 0.1582299 15.0
New Mexico Transit 0.1516163 -38.5
Iowa Transit 0.1475981 -24.0
Massachusetts Workplace -0.1433116 -41.0
Montana Workplace -0.1425870 -40.0
Oregon Transit 0.1411028 -27.5
Oregon Grocery_Pharmacy -0.1401012 -7.0
Michigan Workplace -0.1392891 -40.0
Virginia Grocery_Pharmacy -0.1390971 -8.0
Minnesota Retail_Recreation 0.1370292 -40.5
North Dakota Transit 0.1358019 -48.0
Indiana Retail_Recreation 0.1353485 -38.0
North Dakota Residential -0.1350461 17.0
Arkansas Residential 0.1348793 12.0
Massachusetts Transit -0.1338011 -44.5
Missouri Transit -0.1325616 -24.5
California Grocery_Pharmacy -0.1324812 -11.5
Missouri Parks 0.1300110 0.0
Texas Grocery_Pharmacy 0.1261470 -14.0
Mississippi Retail_Recreation -0.1255429 -40.0
Wisconsin Workplace -0.1198871 -31.0
Wisconsin Grocery_Pharmacy 0.1187332 -1.0
California Retail_Recreation -0.1167047 -44.0
North Carolina Retail_Recreation 0.1164579 -34.0
Wyoming Residential 0.1161018 12.5
Kentucky Residential 0.1143899 12.0
Mississippi Transit -0.1139513 -38.5
Florida Retail_Recreation 0.1108376 -43.0
Utah Grocery_Pharmacy 0.1091474 -4.0
Nebraska Retail_Recreation 0.1071641 -36.0
West Virginia Residential -0.1044892 11.0
Idaho Parks 0.1043225 -22.0
Arkansas Workplace -0.1038753 -26.0
Oklahoma Grocery_Pharmacy -0.1037949 -0.5
Virginia Parks 0.1032969 6.0
Michigan Residential 0.1026348 15.0
Iowa Workplace 0.1017953 -30.0
New Hampshire Grocery_Pharmacy -0.1012035 -6.0
Maryland Transit -0.1009055 -39.0
Oklahoma Parks 0.1007072 -18.5
Mississippi Parks -0.1006947 -25.0
Massachusetts Residential 0.1004898 15.0
Kansas Transit -0.0990126 -26.5
Alabama Retail_Recreation 0.0987013 -39.0
Michigan Transit 0.0971822 -46.0
Michigan Retail_Recreation -0.0940604 -53.0
Minnesota Grocery_Pharmacy 0.0922169 -6.0
Ohio Residential 0.0920772 14.0
New York Residential 0.0911972 17.5
Nevada Parks 0.0904839 -12.5
Indiana Workplace 0.0897352 -34.0
Oklahoma Workplace 0.0872733 -31.0
Virginia Workplace -0.0872073 -31.5
Iowa Grocery_Pharmacy -0.0870891 4.0
Texas Transit 0.0855656 -41.0
Alabama Residential -0.0851937 11.0
South Dakota Transit -0.0850486 -40.0
Georgia Residential -0.0815967 13.0
Oregon Residential -0.0813376 10.5
Minnesota Residential -0.0805607 17.0
Ohio Parks -0.0788303 67.5
Georgia Parks 0.0782512 -6.0
Virginia Retail_Recreation -0.0761290 -35.0
West Virginia Transit -0.0761272 -45.0
Pennsylvania Residential 0.0760760 15.0
Montana Retail_Recreation 0.0746494 -50.0
Kentucky Retail_Recreation 0.0738107 -29.0
Ohio Grocery_Pharmacy 0.0724230 0.0
North Carolina Parks -0.0680156 7.0
Minnesota Workplace -0.0662518 -33.0
Washington Workplace -0.0630205 -38.0
Colorado Transit 0.0624111 -36.0
Maine Residential -0.0623655 11.0
Texas Retail_Recreation 0.0617432 -40.0
Florida Transit -0.0602137 -49.0
Vermont Transit -0.0579995 -63.0
Washington Transit -0.0578963 -33.5
Florida Grocery_Pharmacy 0.0560134 -14.0
California Workplace -0.0490548 -36.0
Ohio Retail_Recreation 0.0486627 -36.0
Kentucky Workplace -0.0485474 -36.5
Washington Residential 0.0481558 13.0
Indiana Transit 0.0467927 -29.0
New Hampshire Workplace 0.0466499 -37.0
South Dakota Retail_Recreation -0.0462126 -39.0
Georgia Transit -0.0440891 -35.0
Missouri Retail_Recreation -0.0421332 -36.0
Oregon Retail_Recreation 0.0417113 -40.5
Washington Parks 0.0415802 -3.5
Ohio Workplace -0.0379952 -35.0
Illinois Grocery_Pharmacy -0.0374895 2.0
Iowa Retail_Recreation 0.0374302 -38.0
Tennessee Transit 0.0319783 -32.0
South Carolina Grocery_Pharmacy 0.0304729 1.0
Arkansas Grocery_Pharmacy -0.0301344 3.0
Arizona Workplace -0.0295777 -35.0
Wisconsin Retail_Recreation 0.0294358 -44.0
Mississippi Workplace -0.0289465 -33.0
New Mexico Workplace 0.0286888 -34.0
Washington Retail_Recreation 0.0285138 -42.0
Maine Grocery_Pharmacy -0.0263840 -13.0
Colorado Grocery_Pharmacy -0.0258579 -17.0
West Virginia Retail_Recreation -0.0254522 -38.5
Michigan Grocery_Pharmacy -0.0222604 -11.0
Indiana Grocery_Pharmacy -0.0205219 -5.5
Colorado Retail_Recreation -0.0204671 -44.0
New Hampshire Transit 0.0196554 -57.0
Illinois Retail_Recreation 0.0193464 -40.0
Kansas Residential -0.0189517 13.0
Nebraska Parks 0.0184637 55.5
Tennessee Retail_Recreation -0.0180628 -30.0
South Carolina Residential -0.0180281 12.0
Colorado Workplace 0.0179386 -39.0
Tennessee Grocery_Pharmacy 0.0175384 6.0
Maryland Parks 0.0154708 27.0
Oklahoma Transit 0.0147160 -26.0
Iowa Residential -0.0130831 13.0
South Carolina Transit 0.0121961 -45.0
South Dakota Grocery_Pharmacy -0.0108775 -9.0
Nevada Grocery_Pharmacy -0.0104365 -12.5
Missouri Grocery_Pharmacy 0.0102389 2.0
Oklahoma Residential 0.0067745 15.0
Kansas Retail_Recreation -0.0063229 -38.0
Arkansas Transit 0.0059380 -27.0
South Carolina Retail_Recreation 0.0055522 -35.0
Wyoming Retail_Recreation 0.0032885 -39.0
Florida Workplace -0.0029392 -33.0
Oklahoma Retail_Recreation 0.0017241 -31.0
Montana Grocery_Pharmacy -0.0011870 -16.0
Alaska Parks NA 29.0
District of Columbia Retail_Recreation NA -69.0
District of Columbia Grocery_Pharmacy NA -28.0
District of Columbia Parks NA -65.0
District of Columbia Transit NA -69.0
District of Columbia Workplace NA -48.0
District of Columbia Residential NA 17.0
# sanity check
ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(x=Total_confirmed_cases.per100,fill=variable))+geom_histogram()+
  facet_grid(~Province.State)+
    default_theme+
  theme(legend.position = "bottom")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

write_plot(mobility.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.plot.png"
write_plot(mobility.global.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.global.plot.png"
(plot_data.permobility_summary.plot<-ggplot(plot_data.permobility_summary,aes(x=variable,y=median_change))+
  geom_jitter(size=2,width=.2)+
  #geom_jitter(data=plot_data.permobility_summary %>% arrange(-abs(median_change)) %>% head(n=15),aes(col=Province.State),size=2,width=.2)+
  default_theme+
  ggtitle("Per-Sate Median Change in Mobility")+
  xlab("Mobility Meaure")+
  ylab("Median Change from Baseline"))

write_plot(plot_data.permobility_summary.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/plot_data.permobility_summary.plot.png"

DELIVERABLE MANIFEST

The following link to commited documents pushed to github. These are provided as a convienence, but note this is a manual process. The generation of reports, plots and tables is not coupled to the execution of this markdown. ## Report This report, html & pdf

Plots

github_root<-"https://github.com/sbs87/coronavirus/blob/master/"

plot_handle<-c("Corona_Cases.world.long.plot",
               "Corona_Cases.world.loglong.plot",
               "Corona_Cases.world.mortality.plot",
               "Corona_Cases.world.casecor.plot",
               "Corona_Cases.city.long.plot",
               "Corona_Cases.city.loglong.plot",
               "Corona_Cases.city.mortality.plot",
               "Corona_Cases.city.casecor.plot",
               "Corona_Cases.city.long.normalized.plot",
               "Corona_Cases.US_state.lm.plot",
               "Corona_Cases.US_state.summary.plot")


deliverable_manifest<-data.frame(
  name=c("World total & death cases, longitudinal",
         "World log total & death cases, longitudinal",
         "World mortality",
         "World total & death cases, correlation",
         "City total & death cases, longitudinal",
         "City log total & death cases, longitudinal",
         "City mortality",
         "City total & death cases, correlation",
         "City population normalized total & death cases, longitudinal",
         "State total cases (select) with linear model, longitudinal",
         "State total cases, longitudinal"),
  plot_handle=plot_handle,
  link=paste0(github_root,"results/",plot_handle,".png")
)


(tmp<-data.frame(row_out=apply(deliverable_manifest,MARGIN = 1,FUN = function(x) paste(x[1],x[2],x[3],sep=" | "))))
##                                                                                                                                                                                                        row_out
## 1                                           World total & death cases, longitudinal | Corona_Cases.world.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
## 2                                 World log total & death cases, longitudinal | Corona_Cases.world.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
## 3                                                         World mortality | Corona_Cases.world.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
## 4                                      World total & death cases, correlation | Corona_Cases.world.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
## 5                                              City total & death cases, longitudinal | Corona_Cases.city.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
## 6                                    City log total & death cases, longitudinal | Corona_Cases.city.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
## 7                                                            City mortality | Corona_Cases.city.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
## 8                                         City total & death cases, correlation | Corona_Cases.city.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
## 9  City population normalized total & death cases, longitudinal | Corona_Cases.city.long.normalized.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
## 10                     State total cases (select) with linear model, longitudinal | Corona_Cases.US_state.lm.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
## 11                                      State total cases, longitudinal | Corona_Cases.US_state.summary.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png
row_out<-apply(tmp, 2, paste, collapse="\t\n")
name handle link
World total & death cases, longitudinal Corona_Cases.world.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
World log total & death cases, longitudinal Corona_Cases.world.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
World mortality Corona_Cases.world.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
World total & death cases, correlation Corona_Cases.world.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
City total & death cases, longitudinal Corona_Cases.city.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
City log total & death cases, longitudinal Corona_Cases.city.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
City mortality Corona_Cases.city.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
City total & death cases, correlation Corona_Cases.city.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
City population normalized total & death cases, longitudinal Corona_Cases.city.long.normalized.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
State total cases (select) with linear model, longitudinal Corona_Cases.US_state.lm.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
State total cases, longitudinal Corona_Cases.US_state.summary.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png

Tables

CONCLUSION

Overall, the trends of COVID-19 cases is no longer in log-linear phase for world or U.S. (but some regions like MD are still in the log-linear phase). Mortality rate (deaths/confirmed RNA-based cases) is >1%, with a range depending on region. Mobility is not a strong indicator of caseload (U.S. data).

See table below for detailed breakdown.

Question Answer
What is the effect on social distancing, descreased mobility on case load?
There is not a strong apparent effect on decreased mobility (work, grocery, retail) or increased mobility (at residence, parks) on number of confirmed cases, either as a country (U.S.) or state level. California appears to have one of the best correlations, but this is a mixed bag
What is the trend in cases, mortality across geopgraphical regions?
The confirmed total casees and mortality is overall log-linear for most countries, with a trailing off beginning for most (inlcuding U.S.). On the state level, NY, NJ, PA starting to trail off; MD is still in log-linear phase. Mortality and case load are highly correlated for NY, NJ, PA, MD. The mortality rate flucutates for a given region, but is about 3% overall.

END

End: ##—— Tue Jun 16 21:21:21 2020 ——##

Cheatsheet: http://rmarkdown.rstudio.com>

Sandbox

# Geographical heatmap!
install.packages("maps")
library(maps)
library
mi_counties <- map_data("county", "pennsylvania") %>% 
  select(lon = long, lat, group, id = subregion)
head(mi_counties)

ggplot(mi_counties, aes(lon, lat)) + 
  geom_point(size = .25, show.legend = FALSE) +
  coord_quickmap()
mi_counties$cases<-1:2226
name_overlaps(metadata,Corona_Cases.US_state)

tmp<-merge(Corona_Cases.US_state,metadata)
ggplot(filter(tmp,Province.State=="Pennsylvania"), aes(Long, Lat, group = as.factor(City))) +
  geom_polygon(aes(fill = Total_confirmed_cases), colour = "grey50") + 
  coord_quickmap()


ggplot(Corona_Cases.US_state, aes(Long, Lat))+
  geom_polygon(aes(fill = Total_confirmed_cases ), color = "white")+
  scale_fill_viridis_c(option = "C")
dev.off()


require(maps)
require(viridis)

world_map <- map_data("world")
ggplot(world_map, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill="lightgray", colour = "white")

head(world_map)
head(Corona_Cases.US_state)
unique(select(world_map,c("region","group"))) %>% filter()

some.eu.countries <- c(
  "US"
)
# Retrievethe map data
some.eu.maps <- map_data("world", region = some.eu.countries)

# Compute the centroid as the mean longitude and lattitude
# Used as label coordinate for country's names
region.lab.data <- some.eu.maps %>%
  group_by(region) %>%
  summarise(long = mean(long), lat = mean(lat))

unique(filter(some.eu.maps,subregion %in% Corona_Cases.US_state$Province.State) %>% select(subregion))
unique(Corona_Cases.US_state$Total_confirmed_cases.log)
ggplot(filter(Corona_Cases.US_state,Date=="2020-04-17") aes(x = Long, y = Lat)) +
  geom_polygon(aes( fill = Total_confirmed_cases.log))+
  #geom_text(aes(label = region), data = region.lab.data,  size = 3, hjust = 0.5)+
  #scale_fill_viridis_d()+
  #theme_void()+
  theme(legend.position = "none")
library("sf")
library("rnaturalearth")
library("rnaturalearthdata")

world <- ne_countries(scale = "medium", returnclass = "sf")
class(world)
ggplot(data = world) +
    geom_sf()

counties <- st_as_sf(map("county", plot = FALSE, fill = TRUE))
counties <- subset(counties, grepl("florida", counties$ID))
counties$area <- as.numeric(st_area(counties))
#install.packages("lwgeom")
class(counties)
head(counties)
ggplot(data = world) +
    geom_sf(data=Corona_Cases.US_state) +
    #geom_sf(data = counties, aes(fill = area)) +
  geom_sf(data = counties, aes(fill = area)) +
   # scale_fill_viridis_c(trans = "sqrt", alpha = .4) +
    coord_sf(xlim = c(-88, -78), ylim = c(24.5, 33), expand = FALSE)


head(counties)
tmp<-unique(select(filter(Corona_Cases.US_state,Date=="2020-04-17"),c(Lat,Long,Total_confirmed_cases.per100)))
st_as_sf(map("county", plot = FALSE, fill = TRUE))

join::inner_join.sf(Corona_Cases.US_state, counties)

library(sf)
library(sp)

nc <- st_read(system.file("shape/nc.shp", package="sf"))
class(nc)


spdf <- SpatialPointsDataFrame(coords = select(Corona_Cases.US_state,c("Lat","Long")), data = Corona_Cases.US_state,
                               proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))

head(spdf)
class(spdf)
st_cast(spdf)

filter(Corona_Cases.US_state.summary,Date=="2020-04-20" & Province.State %in% top_states_modified)
id

https://stevenbsmith.net